Querying SQL databases

Your task is to analyze game and player statistics from the NBA 2008-2009 season provided as a relational database of play-by-play data.

You can use jupyter to execute the queries. To write and test the queries, use the web interface at http://pollux.vo.elte.hu/basketball/. Copy queries into the worksheet for submission.

You can compare your results with https://en.wikipedia.org/wiki/2008%E2%80%9309_NBA_season

Setup pymssql package

In [1]:
! pip install --user pymssql
Requirement already satisfied: pymssql in /home/dvtulf/.local/lib/python3.7/site-packages (2.1.5)
In [1]:
import pymssql
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
In [2]:
# This opens a connection to the MS SQL server
conn = pymssql.connect(server='kooplex-temp.elte.hu',
                       port=1433,
                       user='basketball_user', 
                       password='sayhaTU56#@dfsS',
                       database='Basketball')
# This closes it 
# conn.close()
In [3]:
# We need a cursor that contains the state of our query(s)
cursor = conn.cursor()
In [4]:
# A test query
res = cursor.execute('select * from shot;')
print(cursor.fetchone())
(1, 2, 18, 'LAL', 208, None, None, 23, 13)

Excercise 1 - Simple queries

a) List all teams playing in the league during the season

b) List teams in alphabetic order but division by division

c) Count the number of persons/players in the league

In [5]:
# queries
tables = pd.read_sql('select * from INFORMATION_SCHEMA.tables', conn)
columns = pd.read_sql('select * from INFORMATION_SCHEMA.columns', conn)
cols = columns[['TABLE_NAME','COLUMN_NAME', 'DATA_TYPE']]

# Print tables with holoviews, so it is easier to explore each table's structure
key_dimensions = ['TABLE_NAME']
value_dimensions = ['COLUMN_NAME', 'DATA_TYPE']
hv_data = hv.Table(cols, key_dimensions, value_dimensions)
#t1 = hv_data.to.table(['TABLE_NAME'], [] )
table_names = hv.Div(data="<br>".join(cols['TABLE_NAME'].unique()))
t2 = hv_data.to.table(['COLUMN_NAME', 'DATA_TYPE'])

tt = table_names + t2

tt
WARNING:param.Table: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Layout: Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.AdjointLayout: Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.DivPlot: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.AdjointLayout: Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.TablePlot: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
Out[5]:
In [7]:
###### a)
team = pd.read_sql("SELECT * FROM team;",conn)
team.head(10)
Out[7]:
id division_id name
0 ATL SE Atlanta Hawks
1 BOS AT Boston Celtics
2 CHA SE Charlotte Bobcats
3 CHI CE Chicago Bulls
4 CLE CE Cleveland Cavaliers
5 DAL SW Dallas Mavericks
6 DEN NW Denver Nuggets
7 DET CE Detroit Pistons
8 GSW PA Golden State Warriors
9 HOU SW Houston Rockets
In [8]:
###### b)
team_alphabetic = pd.read_sql("""SELECT * 
                              FROM team
                              ORDER BY division_id, name;""",conn)
team_alphabetic.head(10)
Out[8]:
id division_id name
0 BOS AT Boston Celtics
1 NJN AT New Jersey Nets
2 NYK AT New York Knicks
3 PHI AT Philadelphia 76ers
4 TOR AT Toronto Raptors
5 CHI CE Chicago Bulls
6 CLE CE Cleveland Cavaliers
7 DET CE Detroit Pistons
8 IND CE Indiana Pacers
9 MIL CE Milwaukee Bucks
In [6]:
###### c)
players = pd.read_sql("""SELECT COUNT(person_id) 
                              FROM player;""",conn)

persons = pd.read_sql("""SELECT COUNT(*)
                        FROM person;""", conn)
persons, players #it is not the same
Out[6]:
(      
 0  444,
       
 0  515)

Excercise 2 - Simple joins

a) List teams by division, including the division name

b) List players by team including the uniform number

In [14]:
tt
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension:Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.Dimension: Use method 'get_param_values' via param namespace 
WARNING:param.Dimension:Use method 'get_param_values' via param namespace 
WARNING:param.Layout: Use method 'get_param_values' via param namespace 
WARNING:param.Layout:Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.AdjointLayout: Use method 'get_param_values' via param namespace 
WARNING:param.AdjointLayout:Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.DivPlot: Use method 'params' via param namespace 
WARNING:param.DivPlot:Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.AdjointLayout: Use method 'get_param_values' via param namespace 
WARNING:param.AdjointLayout:Use method 'get_param_values' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
WARNING:param.TablePlot: Use method 'params' via param namespace 
WARNING:param.TablePlot:Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass: Use method 'params' via param namespace 
WARNING:param.ParameterizedMetaclass:Use method 'params' via param namespace 
Out[14]:
In [12]:
###### a)
team_by_div = pd.read_sql("""
            SELECT team.name, division.name
            FROM division
                LEFT JOIN team
                ON division.id=team.division_id
            ORDER BY division.name;
            """, conn)
team_by_div.head(10)
Out[12]:
name name
0 Boston Celtics Atlantic Division
1 New Jersey Nets Atlantic Division
2 New York Knicks Atlantic Division
3 Philadelphia 76ers Atlantic Division
4 Toronto Raptors Atlantic Division
5 Chicago Bulls Central Division
6 Cleveland Cavaliers Central Division
7 Detroit Pistons Central Division
8 Indiana Pacers Central Division
9 Milwaukee Bucks Central Division
In [43]:
###### b)
pd.read_sql("""
            SELECT person.name, player.number, team.name
            FROM person
                LEFT JOIN player
                ON player.person_id=person.id
                    LEFT JOIN team
                    ON team.id=player.team_id
                    
            ORDER BY team.name, person.name;""",conn)
Out[43]:
name number name
0 Acie Law 4 Atlanta Hawks
1 Al Horford 15 Atlanta Hawks
2 Joe Johnson 2 Atlanta Hawks
3 Josh Smith 5 Atlanta Hawks
4 Mario West 6 Atlanta Hawks
... ... ... ...
510 Javaris Crittenton 8 Washington Wizards
511 Juan Dixon 12 Washington Wizards
512 Mike James 13 Washington Wizards
513 Nick Young 1 Washington Wizards
514 Oleksiy Pecherov 14 Washington Wizards

515 rows × 3 columns

Excercise 3 - Simple aggregation

a) Count number of persons playing for a given team

b) List all teams along with the number of players

c) Count the number of teams a given person played for

d) List all players along with the number of teams they played for

e) List players who played for more than two teams during the season

In [32]:
###### a) Let this given team be Atlanta Hawks
pd.read_sql("""
            SELECT COUNT( DISTINCT person_id ) num_of_players
            FROM player
            WHERE team_id = 'ATL'

            """,conn)
Out[32]:
num_of_players
0 15
In [29]:
###### b) 
b = pd.read_sql("""
            SELECT COUNT(id) num_of_players, team_id
            FROM player            
            GROUP BY team_id
            """,conn)
print(b.num_of_players.sum())
b
515
Out[29]:
num_of_players team_id
0 15 ATL
1 16 BOS
2 24 CHA
3 19 CHI
4 16 CLE
5 16 DAL
6 17 DEN
7 15 DET
8 17 GSW
9 17 HOU
10 14 IND
11 19 LAC
12 16 LAL
13 15 MEM
14 18 MIA
15 17 MIL
16 17 MIN
17 15 NJN
18 15 NOH
19 23 NYK
20 20 OKC
21 16 ORL
22 13 PHI
23 18 PHX
24 15 POR
25 21 SAC
26 20 SAS
27 19 TOR
28 15 UTA
29 17 WAS
In [33]:
###### c)
pd.read_sql("""
            SELECT person_id, COUNT(team_id)
            FROM player
            GROUP BY person_id
            """, conn)
Out[33]:
person_id
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
... ... ...
439 440 1
440 441 1
441 442 2
442 443 1
443 444 1

444 rows × 2 columns

In [41]:
###### d)
pd.read_sql("""
            SELECT player.person_id, person.name, COUNT(team_id)
            FROM player
                INNER JOIN person
                ON player.person_id = person.id
            GROUP BY player.person_id, person.name
            """, conn)
Out[41]:
person_id name
0 1 Aaron Brooks 1
1 2 Aaron Gray 1
2 3 Acie Law 1
3 4 Adam Morrison 2
4 5 Adonal Foyle 2
... ... ... ...
439 440 Yao Ming 1
440 441 Yi Jianlian 1
441 442 Zach Randolph 2
442 443 Zaza Pachulia 1
443 444 Zydrunas Ilgauskas 1

444 rows × 3 columns

In [40]:
###### e)
pd.read_sql("""
            SELECT player.person_id, person.name, COUNT(team_id)
            FROM player
                INNER JOIN person
                ON player.person_id = person.id
            GROUP BY player.person_id, person.name
            HAVING COUNT(team_id) > 2
            """, conn)
Out[40]:
person_id name
0 76 Cheikh Samb 3
1 134 Drew Gooden 3
2 414 Tim Thomas 3

Excercise 4 - More complex queries with set operations and common table expressions

a) Pick a game and list all field shots and free throw attempts, including points

b) Combine the previous two into a single table and only show successful attempts

c) Take previous query and sum all points by team so that end result is available

d) Remove filter on game and get end result of each game during season

e) Modify the previous query to return a single line per game, with home team and away team points

In [49]:
###### a) All shots and attempts

pd.read_sql("""
            SELECT * FROM shot
                LEFT JOIN point
                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)  
            WHERE shot.game_id = 1
            """,conn)
Out[49]:
game_id cntr type team_id player_id assist_player_id block_player_id x y game_id cntr points
0 1 2 18 LAL 208 NaN NaN 23.0 13.0 NaN NaN NaN
1 1 4 24 LAL 211 NaN NaN 25.0 6.0 NaN NaN NaN
2 1 6 18 LAL 208 NaN NaN 25.0 10.0 1.0 6.0 2.0
3 1 10 20 POR 412 413.0 NaN 36.0 21.0 1.0 10.0 2.0
4 1 11 26 LAL 211 NaN NaN 30.0 21.0 1.0 11.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ...
158 1 407 1 LAL 218 217.0 NaN 41.0 29.0 1.0 407.0 3.0
159 1 408 20 POR 417 NaN NaN 13.0 15.0 NaN NaN NaN
160 1 413 20 LAL 216 NaN NaN 22.0 18.0 1.0 413.0 2.0
161 1 414 24 POR 409 NaN 216.0 25.0 6.0 NaN NaN NaN
162 1 416 10 LAL 218 NaN NaN 25.0 6.0 1.0 416.0 2.0

163 rows × 12 columns

In [79]:
##### a) All freethrows and attempts
pd.read_sql("""
            SELECT * FROM freethrow
                LEFT JOIN point
                ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
            WHERE freethrow.game_id = 1
            """,conn)
Out[79]:
game_id cntr reason team_id player_id num outof game_id cntr points
0 1 8 3 LAL 208 1 1 1.0 8.0 1.0
1 1 89 3 POR 410 1 1 NaN NaN NaN
2 1 106 3 LAL 218 1 2 1.0 106.0 1.0
3 1 107 3 LAL 218 2 2 1.0 107.0 1.0
4 1 120 3 LAL 220 1 1 NaN NaN NaN
5 1 155 3 POR 418 1 2 1.0 155.0 1.0
6 1 157 3 POR 418 2 2 1.0 157.0 1.0
7 1 169 3 POR 420 1 2 NaN NaN NaN
8 1 171 3 POR 420 2 2 NaN NaN NaN
9 1 200 3 POR 421 1 2 1.0 200.0 1.0
10 1 201 3 POR 421 2 2 1.0 201.0 1.0
11 1 236 3 LAL 213 1 1 NaN NaN NaN
12 1 248 3 LAL 213 1 2 1.0 248.0 1.0
13 1 250 3 LAL 213 2 2 1.0 250.0 1.0
14 1 254 3 LAL 213 1 2 1.0 254.0 1.0
15 1 256 3 LAL 213 2 2 1.0 256.0 1.0
16 1 269 3 LAL 211 1 2 1.0 269.0 1.0
17 1 270 3 LAL 211 2 2 1.0 270.0 1.0
18 1 292 3 LAL 220 1 2 1.0 292.0 1.0
19 1 294 3 LAL 220 2 2 1.0 294.0 1.0
20 1 298 3 LAL 220 1 2 NaN NaN NaN
21 1 300 3 LAL 220 2 2 1.0 300.0 1.0
22 1 322 4 POR 418 1 1 1.0 322.0 1.0
23 1 330 3 LAL 218 1 2 1.0 330.0 1.0
24 1 332 3 LAL 218 2 2 1.0 332.0 1.0
25 1 349 3 LAL 215 1 1 1.0 349.0 1.0
26 1 371 3 POR 410 1 2 NaN NaN NaN
27 1 373 3 POR 410 2 2 NaN NaN NaN
28 1 385 3 POR 415 1 2 NaN NaN NaN
29 1 387 3 POR 415 2 2 1.0 387.0 1.0
30 1 411 3 POR 409 1 2 1.0 411.0 1.0
31 1 412 3 POR 409 2 2 1.0 412.0 1.0
In [65]:
###### b)
pd.read_sql("""
            SELECT * FROM point
                LEFT JOIN shot
                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
            
            
                LEFT JOIN freethrow
                ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)  
            WHERE point.game_id = 1
            """,conn)
Out[65]:
game_id cntr points game_id cntr type team_id player_id assist_player_id block_player_id x y game_id cntr reason team_id player_id num outof
0 1 6 2 1.0 6.0 18.0 LAL 208.0 NaN None 25.0 10.0 NaN NaN NaN None NaN NaN NaN
1 1 8 1 NaN NaN NaN None NaN NaN None NaN NaN 1.0 8.0 3.0 LAL 208.0 1.0 1.0
2 1 10 2 1.0 10.0 20.0 POR 412.0 413.0 None 36.0 21.0 NaN NaN NaN None NaN NaN NaN
3 1 11 2 1.0 11.0 26.0 LAL 211.0 NaN None 30.0 21.0 NaN NaN NaN None NaN NaN NaN
4 1 14 2 1.0 14.0 20.0 LAL 208.0 213.0 None 15.0 17.0 NaN NaN NaN None NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84 1 407 3 1.0 407.0 1.0 LAL 218.0 217.0 None 41.0 29.0 NaN NaN NaN None NaN NaN NaN
85 1 411 1 NaN NaN NaN None NaN NaN None NaN NaN 1.0 411.0 3.0 POR 409.0 1.0 2.0
86 1 412 1 NaN NaN NaN None NaN NaN None NaN NaN 1.0 412.0 3.0 POR 409.0 2.0 2.0
87 1 413 2 1.0 413.0 20.0 LAL 216.0 NaN None 22.0 18.0 NaN NaN NaN None NaN NaN NaN
88 1 416 2 1.0 416.0 10.0 LAL 218.0 NaN None 25.0 6.0 NaN NaN NaN None NaN NaN NaN

89 rows × 19 columns

In [85]:
###### c)
pd.read_sql("""
            SELECT SUM(point.points), CONCAT(shot.team_id, freethrow.team_id) FROM point
                LEFT JOIN shot
                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
            
                LEFT JOIN freethrow
                ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)  
            WHERE point.game_id = 1
            
            GROUP BY CONCAT(shot.team_id, freethrow.team_id)
            """,conn)
Out[85]:
0 96 LAL
1 76 POR
In [155]:
###### d)
pd.read_sql("""
            SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id, point.game_id 
            FROM point
                LEFT JOIN shot
                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
            
                LEFT JOIN freethrow
                ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)  
            
            GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id
            ORDER BY point.game_id
            """,conn)
Out[155]:
points team_id game_id
0 96 LAL 1
1 76 POR 1
2 95 MIL 2
3 108 CHI 2
4 90 BOS 3
... ... ... ...
2347 85 LAC 1174
2348 90 ATL 1175
2349 98 MEM 1175
2350 107 WAS 1176
2351 115 BOS 1176

2352 rows × 3 columns

In [213]:
'''e = pd.read_sql("""
            SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id, point.game_id, home_team_id, away_team_id
            FROM point
                
                LEFT JOIN shot
                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
            
                LEFT JOIN freethrow
                ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
                
                LEFT JOIN game
                ON (game.id = point.game_id 
                    AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id 
                        OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))
                
            GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id
            ORDER BY point.game_id
            """,conn)

'''
Out[213]:
'e = pd.read_sql("""\n            SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id, point.game_id, home_team_id, away_team_id\n            FROM point\n                \n                LEFT JOIN shot\n                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)\n            \n                LEFT JOIN freethrow\n                ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)\n                \n                LEFT JOIN game\n                ON (game.id = point.game_id \n                    AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id \n                        OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))\n                \n            GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id\n            ORDER BY point.game_id\n            """,conn)\n\n'
In [212]:
###### e) Away and home team points in one line
e = pd.read_sql("""
                SELECT A.game_id , A.points away_team_points, A.team_id away_team_id,
                                   B.points home_team_points, B.team_id home_team_id
                
                FROM
                    (SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id,
                        point.game_id, home_team_id, away_team_id
                    FROM point

                        LEFT JOIN shot
                        ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)

                        LEFT JOIN freethrow
                        ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)

                        LEFT JOIN game
                        ON (game.id = point.game_id 
                            AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id 
                                OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))

                    GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id) A,
                    
                    (SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id,
                        point.game_id, home_team_id, away_team_id
                    FROM point

                        LEFT JOIN shot
                        ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)

                        LEFT JOIN freethrow
                        ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)

                        LEFT JOIN game
                        ON (game.id = point.game_id 
                            AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id 
                                OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))

                    GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id) B
                
                WHERE (A.game_id = B.game_id) AND (A.team_id = A.away_team_id) AND (B.team_id = B.home_team_id)
                
                ORDER BY A.game_id
            """,conn)


e # I would be really glad if someone would tell me how could I make this query without copy-pasting the previous query 
  # for the self join
Out[212]:
game_id away_team_points away_team_id home_team_points home_team_id
0 1 76 POR 96 LAL
1 2 95 MIL 108 CHI
2 3 85 CLE 90 BOS
3 4 71 MEM 82 HOU
4 5 108 NOH 103 GSW
... ... ... ... ... ...
1171 1172 84 HOU 95 DAL
1172 1173 97 SAC 90 MIN
1173 1174 126 OKC 85 LAC
1174 1175 90 ATL 98 MEM
1175 1176 107 WAS 115 BOS

1176 rows × 5 columns

In [215]:
e[e["home_team_id"] == "BOS"] #I looked up the results for Boston on Wiki if they match, and they do.
Out[215]:
game_id away_team_points away_team_id home_team_points home_team_id
2 3 85 CLE 90 BOS
20 21 80 CHI 96 BOS
65 66 89 MIL 101 BOS
88 89 87 TOR 94 BOS
104 105 102 ATL 103 BOS
112 113 94 DEN 85 BOS
144 145 86 NYK 94 BOS
157 158 80 DET 98 BOS
210 211 111 GSW 119 BOS
222 223 78 PHI 102 BOS
259 260 78 POR 93 BOS
307 308 77 NOH 89 BOS
332 333 91 UTA 100 BOS
384 385 91 PHI 110 BOS
441 442 83 WAS 108 BOS
482 483 89 HOU 85 BOS
525 526 109 TOR 115 BOS
537 538 86 NJN 118 BOS
569 570 87 PHX 104 BOS
617 618 100 DAL 124 BOS
637 638 100 SAC 119 BOS
664 665 101 MIN 109 BOS
690 691 110 LAL 109 BOS
710 711 105 SAS 99 BOS
822 823 99 IND 104 BOS
844 845 105 DET 95 BOS
871 872 94 CLE 105 BOS
896 897 86 ORL 79 BOS
924 925 92 MEM 102 BOS
969 970 108 MIA 112 BOS
1005 1006 77 LAC 90 BOS
1043 1044 84 OKC 103 BOS
1069 1070 109 CHA 111 BOS
1084 1085 92 ATL 104 BOS
1109 1110 104 NJN 106 BOS
1126 1127 98 MIA 105 BOS
1175 1176 107 WAS 115 BOS

Exercise 5 - Visualize data

Choose one from the following and visualize it!

a) Show the "hottest" positions on the floor where the players scored most! (Create a 2d histogram of the field)

b) What properties of a player are the most correlated with their scoring/assisting/rebound performance? Select some properties and create an interactive visualization for it, that shows the relation between these terms!

c) Create pie charts on the US map that shows for each teams' geolocation their average received/scored point ratio!

Don't forget to comment on the data sources (tables) you used and the method how you obtained the values!

In [45]:
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import seaborn as sns
In [46]:
shots = pd.read_sql("""
            SELECT *  FROM shot
                INNER JOIN point
                ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
            """,conn)
In [47]:
shots
Out[47]:
game_id cntr type team_id player_id assist_player_id block_player_id x y game_id cntr points
0 1 6 18 LAL 208 NaN None 25.0 10.0 1 6 2
1 1 10 20 POR 412 413.0 None 36.0 21.0 1 10 2
2 1 11 26 LAL 211 NaN None 30.0 21.0 1 11 2
3 1 14 20 LAL 208 213.0 None 15.0 17.0 1 14 2
4 1 15 20 POR 414 421.0 None 46.0 9.0 1 15 2
... ... ... ... ... ... ... ... ... ... ... ... ...
87079 1176 427 20 WAS 506 NaN None 47.0 14.0 1176 427 2
87080 1176 428 10 BOS 24 19.0 None 25.0 6.0 1176 428 2
87081 1176 434 45 BOS 18 NaN None 26.0 6.0 1176 434 2
87082 1176 442 1 BOS 17 24.0 None 2.0 6.0 1176 442 3
87083 1176 443 1 WAS 506 NaN None 41.0 26.0 1176 443 3

87084 rows × 12 columns

In [48]:
print(shots.isna().sum()) #We have missing values in x and y. 
shots_xy = shots[["x","y"]]
game_id                 0
cntr                    0
type                    0
team_id                 0
player_id               0
assist_player_id    37850
block_player_id     87084
x                     771
y                     771
game_id                 0
cntr                    0
points                  0
dtype: int64
In [49]:
shots_xy = shots_xy.dropna() #I'll just get rid of them until possible later analysis
In [50]:
plt.figure(figsize=[10,10])
a = plt.hist2d(shots_xy.x, shots_xy.y, bins = (50,50), norm = LogNorm()) 
plt.colorbar()
plt.grid()
#2D histogram of shots

Interactive visualization

In [51]:
import plotly.graph_objects as go
from PIL import ImageColor

colors = ["#440154", "#3D4D8A", "#25838E", "#37B878", "#BADE28"]

ticks = [0, 10**(-3), 10**(-2), 10**(-1), 1]


colorscale = [ [ticks[i], "rgb"+str(ImageColor.getcolor(colors[i], "RGB"))] for i in range(len(ticks))]

# I tried to mimic the matplotlib colorscale
# colorscale is like [[0, "rgb(255,255,255)"],
#                     ...,
#                     [1, "rgb(0,0,0)"]         ] 


fig = go.Figure(go.Heatmap(
        x = a[2],
        y = a[1],
        z = a[0],
        colorscale = colorscale))

fig.show()

Excercise 6 (optional) - a rather complex example

a) Print season results by division. For each team, calculate with a single_ query

  • number of games played
  • wins
  • losts
  • winning percentage

Compare with https://en.wikipedia.org/wiki/2008–09_NBA_season#By_division